Quality Indicators 1 Running head: QUALITY INDICATORS FOR GROUP EXPERIMENTAL RESEARCH Quality Indicators for Group Experimental and Quasi-experimental Research in Special Education

نویسندگان

  • Russell Gersten
  • Lynn S. Fuchs
  • Donald Compton
  • Michael Coyne
  • Charles Greenwood
  • Mark S. Innocenti
چکیده

This article presents quality indicators for experimental and quasi-experimental studies for special education. These indicators are intended to not only evaluate the merits of a completed research report or article but also to serve as an organizer of critical issues for consideration in research. We believe these indicators can be used widely, from assisting in the development of research plans to evaluating proposals. In this paper, the framework and rationale is explained by providing brief descriptions of each indicator. Finally, we suggest a standard for determining whether a practice may be considered evidence-based. It is our intent that this standard for evidenced-based practice and the indicators be reviewed, revised as needed, and adopted by the field of special education. Quality Indicators 3 The purpose of this article is to present a set of quality indicators for experimental and quasi-experimental studies for the field of special education. We believe there is a critical need for such a set of indicators, given the current focus on the need for an increase in rigorous, scientific research in education. Recently, the National Research Council (2002), in a report on scientific research in education, noted that they saw no reason why education could not be subject to the same scientific methods as other disciplines such as chemistry or physics. They further argued that there is a place for all research methodologies in educational research: survey research, qualitative research, and correlational research. As Feuer, Towne, and Shavelson (2002) noted, “If a research conjecture or hypothesis can withstand scrutiny by multiple methods, its credibility is enhanced greatly. Overzealous adherence to the use of any given research design flies in the face of this fundamental principle” (p.8). Yet the NRC report is unequivocal in stating numerous times that experiments using randomized trials are currently underutilized in educational research, despite being “the single best methodological route to ferreting out systematic relations between actions and outcome” (Feuer et al., 2002). In that sense, they reiterate a point made over a generation ago by Campbell and Stanley (1966), who highlighted that controlled experimental research conducted in field settings was “the...only means for settling disputes regarding educational practice, as the only way of verifying educational improvements, and as the only way of establishing a cumulative tradition in which improvement can be introduced without the danger of a faddish discard of old wisdom in favor of inferior novelties” (p. 2). Until recently, there was no specific set of standards or quality indicators for evaluating the quality of either proposed or completed experimental or quasi-experimental intervention research. Few of the numerous educational research textbooks provide a clear list of criteria that Quality Indicators 4 fit the realities of contemporary field research, although they present excellent ideas (e.g., Boruch, 1997; Gall, Borg, & Gall, 2002; Shadish, Cook, & Campbell, 2002). Fortuitously, slightly before we began this endeavor, the What Works Clearinghouse of the U.S. Department of Education released the Study Design and Implementation Assessment Device (DIAD) (Valentine & Cooper, 2003). Its major goal is to evaluate whether a research article or research report can be considered sufficiently valid and reliable to be entered into a research synthesis on the effectiveness of an intervention or approach. Our goal is a bit broader. We intend these quality indicators to be used not only t o evaluate the merits of a completed research report or article but also to evaluate research proposals: to funding agencies or university dissertation committees. We also intend this to be useful to researchers as they think through the design of a study, to serve as a checklist or organizer of critical issues for consideration. We intentionally chose to look at quality indicators, rather than a set of “yes/no” standards, because evaluating a research design always entails weighing the relative strengths and weaknesses in a set of domains. Our goal is not to provide a basic primer on how to design a high quality field research study. There are numerous introductory (e.g., Gall et al., 2002) and advanced texts (e.g., Shadish et al., 2002) widely available to explain the many concepts underlying the design of experimental studies. Our hope is that our set of indicators will be field tested and refined, and then considered useful by journal editors and reviewers of Federal grant proposals. We also envision this set of indicators assisting researchers as they design studies and guide practitioners as they consider alternative educational practices for adoption in their classrooms and schools. Quality Indicators 5 Quality indicators for group experimental and quasi-experimental research proposals in special education are presented in Table 1. Table 2 presents a set of quality indicators for evaluating the quality of completed experimental or quasi-experimental studies. These tables address similar issues, but serve slightly different purposes. Both tables make a distinction between indicators that are considered “essential” for quality versus indictors that are considered to be “desirable” to have in a study. We suggest that these indicators be used to define “acceptable” and “high” quality research proposals and studies. To be considered acceptable quality, a research proposal or study would need to meet all but one of the “Essential Quality Indicators” and demonstrate at least one of the quality indicators listed as “Desirable” listed in Tables 1 and 2. To be considered high quality, a proposal or study would need to meet all but one of the “Essential Quality Indicators” and demonstrate at least four of the quality indicators listed as “Desirable.” These definitions of “acceptable” and “high quality” are tentative and should be field tested by universities, agencies that review grant applications, and research organizations. In the following section, we “walk” the reader through the framework by providing brief descriptions of the indicators. Finally, we suggest a standard for determining whether a specific practice may be considered evidence-based. Quality Indicators for Group Experimental and Quasi-experimental Research Conceptualization of the Research Study Because a research study without context will have no impact, it is critical that the researcher clearly present the conceptualization and design of a study. In this section, we describe four indicators for providing context for a research study. Quality Indicators 6 The conceptualization of the research study is based on the findings of rigorously designed studies that reflect the current scope of extant knowledge, including the findings of seminal studies. OR If an innovative approach is proposed, it is based on sound conceptualization and is rooted in sound research. The literature review is a key section for the conceptual design of a study. Typically, it describes prior research that leads to current conceptualizations and study design. It is important that the review presents the existing information and makes the case for the proposed research. The review of literature should reflect both recent and seminal research studies in the area, making the reader aware of how these studies relate to the proposed research study. If there is no recent research, the researcher should state this clearly. Researchers should ensure that the review of literature not only is adequate in the breadth of studies covered, but also is focused on the small set of issues that are critical for this particular study. The end result is that the researcher should present a concise but complete summary of the scientific knowledge base for the area in which the problem exists, should illuminate areas of consensus and areas that require further investigation, and create a theoretical and conceptual understanding for why the study addresses an important topic that has not been fully addressed in the past. Whether the researcher proposes an innovative approach for which little existing empirical evidence exists, or interventions that are subtle variants on evidence-based practices, he/she should focus the review of literature on providing a compelling argument for the approach, and that sets the stage for the guiding research questions. A compelling case for the importance of the research is made. Feuer and colleagues (2002) make a strong case for an increased emphasis on presenting arguments regarding the importance of any research study when writing or discussing research. Quality Indicators 7 Part of their argument focuses on the increased salience of educational research in the current political climate. The question of how researchers choose the topics on which they focus their research is important, and the rationale for these choices must be conveyed to consumers or funders of research projects. Fabes, Matrin, Hanish, and Updegraff (2000) recently discussed the issue of evaluating the significance of developmental research for the 21st century. These authors identity four new types of “validity” which, with some adaptation, can be applied to educational research. These types of validity are incidence validity, impact validity, sympathetic validity, and salience validity. Incidence validity refers to the degree to which a particular piece of research addresses a topic that significantly affects large numbers of people. Impact validity is the degree to which a research topic is perceived to have serious and enduring consequences. Sympathetic validity reflects the tendency to judge the significance of the research based on the degree to which it generates feelings of sympathy for individuals affected by the problem under study. Salience validity reflects the degree to which people, generally referring to the public, are aware of the problem or topic. Fabes and colleagues note that it is difficult for one study to incorporate all four types of validity but that it is important for researchers to be aware of these types of validity and to look at the pattern of validity addressed by the study in question. These seem helpful concepts to use in considering this quality indicator. Valid arguments supporting the proposed intervention as well as the nature of the comparison group are presented. It is important to characterize the intervention and place it within a context that makes sense to the reader. The rationale for the proposed intervention should be sound and based, in Quality Indicators 8 part, on facts and concepts presented in the review of the literature. If the intervention is applied to new participants, settings, or context, there should be clear links based on argument and/or research for this new use. In addition to describing the intervention, it is important to describe the nature of the experience or services to be received by the comparison group (Feuer et al., 2002; Gersten, Baker, & Lloyd, 2000). Potentially, a variety of reasons may exist for establishing comparison groups. The most common is describing an innovative approach to a traditional one. In this case, readers need to know the nature of experiences of participants in the more traditional comparison group. At times, researchers contrast theoretically opposed or divergent interventions. This, too, is legitimate, especially when the contrast is focused and of high interest. There are occasions when a researcher is broaching a new topic and simply wants to know if it leads to change. These would be the only instances where a no treatment comparison group would be appropriate. Designs that are more elegant include multiple groups, some involving subtle contrasts (e.g., taught by an interventionist or teacher vs. a paraprofessional; identical intervention with or without progress monitoring). The research questions are appropriate for the purpose of the study and are stated clearly. In most research proposals, a succinct statement of the key questions addressed is critical. These questions should be linked in an integral fashion to the purpose statement. The criteria for the research questions are simple and serve to pull the conceptualization and design of the study together. Hypothesized results should be specified, but the authors of a proposal should be candid about issues for which they have no specific prediction and reviewers need to be aware that it is equally appropriate to frame clear research questions without a specified hypothesis. Quality Indicators 9 Describing Participants Sufficient information to determine/confirm whether the participants demonstrated the disability(ies) or difficulties addressed is presented. A fundamental issue for generalizing study results is whether the participants actually experienced the disability (ies) or the difficulties the study was intended to address. To resolve this issue, researchers must move beyond school district provided labels. Informing readers that students met the state criteria for identifying students as developmentally delayed provides little information on the type or extent of disability a student may have. Researchers need to provide a definition of the relevant disability (ies) or difficulties and then include assessment results documenting that the individuals included in the study met the requirements of the definition. For example, if the study addressed students with math difficulty, the researchers would (a) define math difficulty (e.g., performing below the 16 percentile on a test of mathematics computation and performing below the 16 percentile on a test of mathematics concepts) and (b) document that each participant included in the study met these criteria. We recommend that researchers attempt to link their definitions to those in the current literature, which can be done by either replicating the definition used in prior published research or by explaining the reason for extending or redefining the operational definition and provide a new label for this alternative definition (to increase specificity of terms in research). Researchers should provide enough information about participants so that readers can identify the population of participants to which results may be generalized. This additional information may include, but is not limited to, comorbidity disability status (e.g., what percentage of students with reading disabilities also had math disability and/or attention deficit hyperactivity disorder?), demographics (e.g., age, race, sex, subsidized lunch status, English Quality Indicators 10 language learner status, special education status), scores related academic assessments (with effect sizes also desirable), and percentage of students receiving subsidized lunch for participating schools. As part of describing participants, group difference on salient variables must also be presented. Given the issues discussed above on participant selection, comparability of groups on key demographic variables must be examined, both to describe the participants as well as to use in later analyses. It is also the researchers’ responsibility to document sample comparability at pretest on at least one outcome measure (or key predictor of outcome). Demonstrating such comparability eliminates the possibility that study effects accrued due to pre-existing differences between the study groups on key performance or demographic variables. Appropriate procedures are used to increase the probability that participants were comparable across conditions. The optimal method for assigning participants to study conditions is through random assignment, although in some situations, this is impossible. It is then the researchers’ responsibility to describe how participants were assigned to study conditions (convenience, similar classrooms, preschool programs in comparable districts, etc.). Researchers are urged, however, to attempt random assignment. In our experience, some type of random assignment can often be negotiated with districts or school or clinic personnel. The quality of the research design is invariably higher with random assignment of participants and intervention providers. In working with school districts, school district procedures sometime do not allow random assignment of students (participants) but will allow random assignment of classrooms, the intervention provider in many cases. This situation is preferable to no random assignment. However, this method of random assignment has implications on design and analyses. For Quality Indicators 11 example, if the intervention is at the child level (e.g., peer tutoring), this method of random assignment may be acceptable. However, if the intervention is a curriculum delivered by the classroom teacher and entire classrooms are assigned to intervention conditions, children cannot be considered randomly assigned. The impact of these decisions must be considered and described by the researcher. It is important to note that random assignment of participants to study conditions does not guarantee equivalent study groups on key characteristics of the participants. In fact, given the heterogeneity of many disability subgroups, it can be difficult to achieve initial comparability on key variables, with or without random assignment. For this reason, matching participants on a salient variable(s) and randomly assigning one member of each pair to a treatment condition, or a stratified assignment procedure to study conditions is often preferred. Differential attrition among intervention groups or severe overall attrition is documented. Researchers should document overall attrition of participants and ensure that attrition rates between the intervention and comparison groups were not substantially different. The reason for this is to document that study groups remain comparable at the conclusion of the study. Sufficient information describing important characteristics of the intervention providers is provided, and appropriate procedures to increase the probability that intervention providers were comparable across conditions are used. Researchers should provide enough information about the intervention providers (i.e., teachers or other individuals responsible for implementation) so that readers understand the type of individual who may be capable of administering the intervention when it is used outside the Quality Indicators 12 context of that study (i.e., the intervention providers to which results may be generalized). Relevant characteristics may include age, sex, race, educational background, prior experience with related interventions, professional experience, and number of children with and without disabilities in the family (for parents). It may also be useful to present information on assessments involving knowledge of the topic to be taught, knowledge of appropriate pedagogical methods, efficacy measures and/or attitudinal measures. The researchers must describe how intervention providers were assigned to the various study conditions. Random assignment is the preferred method, but other approaches often are necessary due to the circumstances in which the study is conducted. For example, researchers should also consider situations where intervention providers are counterbalanced across conditions. This can be done by having each interventionist teach one group with one method and another group with another method or by switching interventionists across conditions at the midpoint. The burden is on the researchers to describe assignment methods and, when random assignment or counterbalancing is not employed, to provide a rationale for how those methods support the integrity of the study. Comparability of interventionists across conditions should be documented. Regardless of whether random assignment was employed, researchers must quantify relevant background variables for the intervention providers associated with each of the study conditions. This is especially true if the interventionists are not counterbalanced across conditions. With such information, readers can be assured that the effects of the intervention are not due to pre-existing differences between the intervention providers (i.e., the favorable effects for an intervention are in fact attributable to the intervention). Quality Indicators 13 Implementation of the Intervention and Description of Nature of Services in Comparison Conditions The intervention is clearly described and specified. Researchers should provide a precise description of the independent variable to allow for systematic replication as well as to facilitate coding in research syntheses such as meta-analysis (Gersten, Baker, & Lloyd, 2000). Unfortunately, the instructional labels that are often assigned to interventions are vague or misleading and may vary considerably from study to study. For example, in a meta-analysis on the impact of various instructional approaches on students with learning disabilities, Swanson and Hoskyn (1998) found that there was significant overlap in the way constructs with similar labels were operationalized. They concluded that classifying types of teaching in broad terms such as direct instruction and strategy instruction was problematic and that more fine-grained terminology and descriptions needed to be used. Interventions should be clearly described on a number of salient dimensions, including conceptual underpinnings, detailed instructional procedures, teacher actions and language (e.g., modeling, corrective feedback), use of instructional materials (e.g., task difficulty, example selection), and student behaviors (e.g., what students are required to do and say). Precision in operationalizing the independent variable has become increasingly important as replication studies become more refined and synthesis procedures such as meta-analysis become more common. In fact, the purpose of research-syntheses is to “discover the consistencies and account for the variability in similar-appearing studies” (Cooper & Hedges, 1994, p. 4). When analyzing similarities and differences among instructional interventions across multiple studies, or when trying to determine the effect that subtle changes in instruction might have on learning outcomes, precise descriptions of instruction are critical. Quality Indicators 14 Fidelity of implementation is described and assessed in terms of surface (the expected intervention is implemented) and quality (how well the intervention is implemented) features. Fidelity of implementation (also known as treatment fidelity or treatment integrity) refers to the extent to which an intervention is implemented as intended (Gresham, MacMillan, BeebeFrankenberger, & Bocian, 2000). Information about treatment fidelity is essential in understanding the relationship between an intervention (i.e., independent variable) and outcome measures (i.e., dependent variables). The goal of experimental research in special education is to demonstrate that any changes in a dependent variable are the direct result of implementing a specified intervention. Without evidence about whether the intervention was actually implemented as planned, however, it is impossible to establish this relationship unequivocally. This indicator is concerned with both whether treatment fidelity was measured and how it was measured. Although necessary, it is not sufficient to only note the number of days/sessions that intervention was conducted. At the least, researchers should observe the intervention using a checklist of treatment components and record whether the most central aspects of the intervention occurred. A fidelity score can then be derived by calculating the percentage of treatment components that were implemented. Observations should take place over the entire course of the study on a regular basis. The research team should include some measure of interobserver reliability. Key features to look for in assessing this criterion are: (a) inclusion of key features of the practice (e. g., ensuring students use all five steps in the strategy, ensuring students are asked to draw a visual representation of a mathematical problem, ensuring teachers provide the specified number of models of a cognitive strategy), (b) adequate time allocation per day or week for the intervention, and (c) coverage of specified amount of material in the Quality Indicators 15 curriculum or teacher guide (when appropriate). Although we label this “surface” fidelity, it is an essential part of any study and thus extremely important. Some interventions, such as home visiting interventions, may not be as “visible”, but still require fidelity assessments. Videotapes of visits may need to be obtained and scored. Occasionally interventions will theorize a causal chain that requires assessment of intermediate links in the chain. For example, a home visiting intervention may theorize that the home visitors working with parents in a certain way will change the way parents work with their child, which in turn will influence the child’s behavior. In this type of situation, assessment of the causal link, the way the parents work with their child, is important to be sure that the intervention worked as theorized. More sophisticated measures of fidelity require an observer to document not only the occurrence of prescribed steps in the process, but also that the interventionist employed the techniques with quality. For example, an observer can evaluate whether teachers’ model skills and strategies using clear language and interesting examples, or whether they scaffold or provide corrective feedback consistently. Investigating treatment fidelity at this level provides a deeper understanding of implementation issues and can lead to important insights about intervention components and teacher behaviors that are more directly related to desired outcomes. It also can be a lens into how well interventionists understand the principles behind the concept. This level always requires some level of inference. Use of observational field notes or audiotapes of select sessions can be used to gain an understanding of quality. Often, selected use of transcripts can enrich a journal article or research report on the study by giving the reader a sense of how the intervention plays out with students and actual curricula materials (e. g., Fuchs & Fuchs, 1998; Palincsar & Brown, 1984). Finally, it Quality Indicators 16 is important to assess treatment fidelity in a way that allows researchers to ascertain whether the level of implementation was constant or whether fidelity varied or fluctuated within and across interventionists. Researchers should carry out this type of analysis both for individual interventionists over time to determine if implementation was consistent across the duration of the intervention as well as between interventionists to determine if some teachers delivered the treatment with greater integrity than others. These data can serve numerous other purposes. If researchers find a reasonable amount of variability, as will often be the case in studies that use quality of implementation measures, they can correlate implementation scores with outcomes or use contrasted groups analysis to see if quality or intensity of implementation relates to outcome measures. They also can use the profiles to get a sense of typical implementation problems. These data can be used to develop or refine professional development activities. The nature of services provided in comparison conditions are described and documented. One of the least glamorous and most neglected aspects of research is describing and assessing the nature of the instruction in the comparison group (Gersten, Baker et al., 2000). Yet, to understand what an obtained effect means, one must understand what happened in the comparison classrooms. This is why researchers should also describe, assess, and document implementation in comparison groups. At a minimum, researchers should examine comparison groups to determine what instructional events are occurring, what texts are being used, and what professional development and support is provided to teachers. Other factors to assess include possible access to the curriculum/content associated with the experimental group’s intervention, time allocated for instruction, and type of grouping used during instruction (Elbaum, Vaughn, Hughes, & Moody, Quality Indicators 17 1999). In some studies, assessment of the comparison condition may be similar to what occurs for treatment fidelity. Every study will vary based on specific research issues. Outcome Measures Multiple measures are used to provide an appropriate balance between measures closely aligned with the intervention and measures of generalized performance. Far too often, the weakest part of an intervention study is the quality of the measures used to evaluate the impact of the intervention. Intervention researchers often spend more time on aspects of the intervention related to instructional procedures than on dependent measures. However, using tests of unknown validity invariably weakens the power of a study Thus, a good deal of the researcher's effort should be devoted to selection and development of dependent measures. In essence, the conclusion of a study depends not only on the quality of the intervention and the nature of the comparison groups, but also on the quality of the measures selected or developed to evaluate intervention effects. One of the foremost challenges in crafting a study is selection of measures that are well aligned with the substance and of the intervention and yet are sufficiently broad and robust to (a) avoid criticism for “teaching to the test” through the specific intervention, and (b) demonstrate that generalizable skills have been successfully taught. An intervention may have adverse effects or additional benefits that a researcher should attempt to identify by using measures of generalized performance. Multiple measures should be used in a study, as no measure is perfect, and no measure can assess all or even most of the important aspects of performance that an intervention might affect. If a study addresses more than one construct (e.g., fluency with arithmetic facts and problem solving ability), it can be valuable to include multiple tools to measure each facet of performance. Quality Indicators 18 Finally, in special education research that includes a wide range of different types of students, it is critical that measures are able to adequately assess performance of students with different disabilities. This may sometimes involve use of accommodations or use of alternate formats such as structured interviews on the content rather than written essays (see for example (Bottge, Heinrichs, Mehta, & Ya-Hui, 2002; Ferretti, MacArthur, & Okolo, 2001; Gersten, Baker, Smith-Johnson, Peterson, & Dimino, in press). Evidence of reliability and validity for the outcome measures is provided. Estimation of internal consistency reliability (often referred to by technical names such as coefficient alpha or Cronbach's alpha) is a critical aspect of a research study; however, this is often neglected. This omission is difficult to understand since coefficient alpha is easy to compute with common statistical packages, and the information is very important. Internal consistency reliability helps us to understand how well a cluster of items on a test "fit" together, and how well performance on one item predicts performance on another. These analyses help researchers locate items that do not fit well, or items with a weak item-to-total correlation. Revising these items or dropping them from the measure used increases the reliability and thus increases the power of a study. The bottom line for reliability are coefficient alpha reliabilities of at least .6. This may seem low; however, measurement and design experts such as Nunnally (1994) and (Shadish et al., 2002), indicate that this level of reliability is acceptable for a newly developed measure or a measure in a new field for two reasons. First, in experimental research, as opposed to individual assessment, we are concerned with the error associated with the mean score. The standard error of measurement of the mean is appreciably less for a sample mean than an individual test score. Quality Indicators 19 Second, internal consistency of .6 or higher indicates that a coherent measurement construct is being measured. An ideal mix of outcome measures in a study would also include psychometrically sound measures with a long record of accomplishment, such as the Woodcock Johnson or WalkerMcConnell, with Cronbach alpha reliabilities over .8 and newer measures with coefficient alpha reliability of .6 or so. Although we strongly encourage inclusion of data on test-retest reliability, an intervention study is typically considered acceptable without such data, at the current point in time. At the same time, some indication of concurrent validity is also essential. For newly developed measures, using the data generated in the study would be acceptable, as would data from a study of an independent sample. Yet, to be highly acceptable, the researcher should independently conduct some type of concurrent validity. Concurrent validity becomes even more critical when using measures for groups other than those for which the test was designed (e.g., using a measure translated into Spanish for use with Spanish-speaking bilingual students). For studies to rank in the highly acceptable category, empirical data on predictive validity of measures used and any information on construct validity should be reported. Outcomes for capturing the intervention’s effect are measured at the appropriate times. The goal of measurement activities in experimental research is to detect changes in a dependent variable, which are the result of implementing an independent variable. Depending on the nature of the independent variable and a study’s research questions, there may be critical periods of time in which intervention effects can be captured through data collection activities. Many times, the effects of an intervention are best detected immediately. For example, if researchers are interested in evaluating the initial effectiveness of a four-week intervention Quality Indicators 20 designed to introduce a strategy for deriving word meanings from context, data should be collected within a few days of the end of the intervention. If data collection is delayed, initial intervention effects may fade or subsequent instruction or other uncontrolled events may contaminate results. Sometimes, however, important intervention effects cannot be detected immediately. Information about the delayed or long-term effects of interventions is extremely important but often not addressed or measured in research studies. For example, researchers may be interested in determining whether students are still able to apply the word-deriving strategy six weeks after the intervention ended. In cases such as this, researchers should time data collection activities to align with research questions. For some studies, it may be appropriate to collect data at only preand posttest. In many cases, however, researchers should consider collecting data at multiple points across the course of the study, including follow-up measures. Multiple data points allow researchers to apply more complex statistical procedures such as Hierarchical Linear Modeling or Growth Curve Analyses. These analyses allow for an examination of individual student trajectories as well as overall group effects. These data can provide researchers with a more nuanced, complex picture of student growth as well as information about immediate and long-term intervention effects. Thus, we would consider these indicators of exemplary research designs, provided the array of measures makes sense developmentally. Quality Indicators 21 Data collectors and/or scorers are blind to study conditions and equally (un) familiar to examinees across study condition. Data collection activities may inadvertently affect a study’s findings. One way this can happen is if data collectors and/or scorers are aware, of which students took part in the treatment and control conditions. Experimenter bias may result from this situation if researchers’ expectations unintentionally influence assessment or scoring procedures. Study findings can also be affected if some participants are familiar with data collectors while other participants are not. In this case, participants may feel more comfortable with individuals that they know and therefore perform better during testing. Researchers should design and carry out data collection activities in a way to minimize these threats to the study’s internal validity. This is optimal for any study, although there will be times when it is impossible to implement. In all cases, however, data collectors should be trained and inter-rater/observer/tester reliabilities conducted. Adequate inter-scorer agreement is documented. Although researchers must select outcome measures that align with the purpose of the study and that possess acceptable levels of reliability and validity, the quality of the data from these measures is only as good as the corresponding data collection and scoring procedures. To ensure the interpretability and integrity of study data, researchers should also consider carefully how data is collected and scored. Researchers should ensure that test administration and scoring is consistent and reliable across all data collectors and scorers. Documentation of agreement is especially important when using newly developed measures or when scoring involves subjective decisions. Inter-scorer agreement can be evaluated by having multiple data collectors administer assessments to a Quality Indicators 22 random sample of study participants or by observing selected testing sessions. Similar procedures can be used to check agreement in data scoring. Ideally, reliability in data collection and scoring should be above .90. Data Analysis We highlight several key themes in this set of indicators. The linkage of data analysis and unit of analysis to key research questions is important. In addition, we emphasize that the researcher use appropriate statistical techniques to adjust for any pretest differences on salient predictor variables. A key issue is ensuring that data analyses and research questions are aligned with the appropriate unit of analysis for a given research question. Researchers should actually define which unit was used in the statistical analysis of intervention effects. For example, in one type of analysis, individual students may be the unit of analysis; in another case, where students are taught in pairs, the pair may well be the appropriate unit of analysis. The determination of the appropriate unit of analysis relates directly to the research hypotheses/questions, the sample, and assignment of sample to experimental conditions. When students are the unit, the n equals the number of students in the analysis and each student contributes his/her score. When classrooms are the units, the n equals the number of classrooms, and individual students’ scores are averaged to reflect a class value. When a full class is a unit, it may well be advisable to assess all of the special education students but assess only a stratified random sample of the non-disabled students. It is also possible that within the same study, for example, the teacher may be the appropriate unit of analysis for some measures (e.g., teacher knowledge of inclusive practices), and the individual student may be the best unit of analysis for others (e.g., degree of engagement Quality Indicators 23 in academic activities with peers). When appropriate, we urge the use of multilevel analyses because they are uniquely designed to consider multiple units within a single analysis. The quality indicators related to data-analytic strategies and study design are essential. However, it would require, at a minimum, a full issue of this journal, to begin to do justice to the topic. Choice of appropriate data-analytic strategy has always been partly art and partly science. Usually, numerous strategies are appropriate, some more elegant and sophisticated than others. In the past decade, there have been enormous advances in this area, with increased understanding of and wide use of structural equation modeling and multi-level modeling. Thus, the array of options is large and determining a strategy or set of strategies that is sensitive/powerful enough to detect treatment effects, while also sophisticated enough to look at important secondary patterns or themes in the data, is a complex task. In this brief section, we highlight a few of the key themes that both those designing studies and those evaluating proposals or research reports need to keep in mind when considering data analyses. The data analysis techniques chosen are appropriate and linked in an integral fashion to key research questions and hypotheses. Statistical analyses are accompanied with presentation of effect sizes. Merely cataloguing every possible analysis that can be done is unacceptable. A brief rationale for major analyses and for selected secondary analyses is critical. The variability within each sample is accounted for either by sampling or statistical techniques such as analysis of covariance. Quality Indicators 24 Post-intervention procedures are used to adjust for differences among groups of more than 0.25 of a standard deviation on salient pretest measures. Planned comparisons are used when appropriate. The researcher should clearly link the unit of analysis chosen to the key statistical analyses. Support for the unit(s) of analysis relative to the conceptual and practical foci of the treatment is made. Multilevel analysis is used when appropriate. Although data can always be analyzed at multiple levels, the reader of a proposal or report should clearly know the levels for which the analysis will be most powerful and which unit of analysis is most appropriate for a specific research question and design. Appropriate secondary analyses are conducted (e.g., testing for the effectiveness of the treatment within important subgroups of participants or settings). A power analysis is provided to describe the adequacy of the minimum cell size. A power analysis is conducted for each unit of analysis to be examined (e.g., school and class as well as student). When proposing a study, researchers should provide a power analysis for the most important analyses and either explicitly or implicitly indicates how they address the major factors that influence the power of a research design. In other words, they need to indicate whether they anticipate a small, medium or large effect and how they have addressed issues of sample heterogeneity/variability in their design and analysis strategies. The power analysis should indicate that the design has adequate sample size to detect hypothesized effects. The researchers should indicate the impact of waves of assessments on the power of their analyses. Quality Indicators 25 Determining When a Practice is Evidence-Based: A Modest Proposal Currently, there is a great deal of deliberation and discussion on what it means to call an educational practice or special education intervention “evidence-based.” Particularly controversial issues are the relative weightings of randomized trials versus quasi-experiments and to what extent we can generalize the findings across various subgroups of students with disabilities. Another key issue is how to make the determination that an evidence based practice is implemented with such low quality that we could no longer assert that it is likely to enhance learner outcomes. The authors of this article were not in complete agreement on any of these issues. However, we decided, with some trepidation, to present a modest proposal, a criterion for concluding a practice is evidence-based. In reaching this determination, we were heavily influenced by the research and scholarship on research synthesis and meta-analysis over the past twenty years (e.g. ,Cooper & Hedges, 1994), and the research syntheses conducted in our field (e.g., Gersten, Schiller, & Vaughn, 2000; Swanson & Hoskyn. 1998). • There are at least four acceptable quality studies, or two high quality studies that support the practice; and • The weighted effect size is significantly greater than zero. Again, to be considered “acceptable,” a study would need to meet all but one of the “Essential Quality Indicators” specified in Table 2 and demonstrate at least one of the quality indicators listed as “Desirable.” To be considered “high quality,” a study would need to meet all but one of the “Essential Quality Indicators” specified in Table 2 and demonstrate at least four of the quality indicators listed as “Desirable.” Quality Indicators 26 The reason we stressed the weighted effect size is that this statistic takes into account: a) the number of studies conducted on the specific intervention; b) the number of participants in the studies; c) the magnitude and consistency of effects. To us, these three components seem to be most critical for making determinations as to whether a practice is evidence-based. We propose the following criteria for considering a practice as promising: • There are at least four acceptable quality studies, or two high quality studies that support the practice; and • There is a twenty percent confidence interval for the weighted effect size that is greater than zero. To reach these recommendations, we weighed the quality of the research design (e.g., Feuer et al., 2002; Shadish et. al., 2002) heavily. There are several reasons for doing this. Recently, Simmerman and Swanson (2001) documented the impact of specific flaws in research design on a study’s effect size. They argued: “Among the factors that determine a study’s credibility, it is often the internal and external validity variables that establish research quality and have significant effects on treatment outcomes” (p. 211). To test this hypothesis Simmerman and Swanson examined the effects of a large set of internal and external validity variables on treatment outcomes for students with learning disabilities (LD) using studies identified in the Swanson and Hoskyn (1998) meta-analysis on treatment outcomes and LD. Results indicate the following factors lead to lower effect sizes: controlling for teacher effects, using standardized rather than just experimenter-developed measures, using the appropriate unit in data analyses, reporting the sample’s ethnic composition, providing psychometric information, and using multiple criteria defining the sample. Simmerman and Swanson’s (2001) results indicate that better controlled studies appear to be less biased in favor Quality Indicators 27 of the intervention. It is interesting to note that use of random assignment (versus use of quasiexperimental designs) approached, but did not reach statistical significance (p = .0698). Thus, issues addressed by these quality indicators appear to influence the inferences we make about effective approaches for special education. Better-controlled studies (in terms of using measures of documented reliability, controlling for effects of teachers or interventionists, and using the appropriate unit of analysis in statistical calculations) tend to be less biased. Use of standardized measures of broad performance lead to weaker estimates of effectiveness of an intervention. The key lesson we can learn from the Simmerman and Swanson analysis is that research quality does matter and does have educational implications. We feel it is important to note that conducting quality research is more expensive than conducting experiments that are compromised at various levels. We see signs of awareness of this fact in both the National Research Council (2002) report on educational research and in some current Federal initiatives. Routinely conducting quality research in public schools will also require a shift in the culture of schools, much as it required a shift in the culture of medical clinics and hospitals fifty years ago, and of public welfare programs two decades ago. Active support by the U. S. Department of Education will, in our view, be required. It would seem to go hand in hand with the current emphasis on scientifically based research. This set of criteria is merely a first step. We envision the following as critical next steps: • Field testing of this system of indicators by competent individuals as they review grant applications and other research proposals and manuscripts submitted to journals for publication. • Refinements based on field-testing in both the areas of proposal review and of research synthesis. Quality Indicators 28 • Consideration for adoption by journals in our field and/or funding agencies such as Institute of Educational Sciences and Office of Special Education Programs. • Serious field-testing of the quality indicators’ impact on evidence-based practice needs to be conducted. The issue of integrating findings from different types of research (e.g., correlational, single-subject design, qualitative) needs to be considered as part of this fieldtesting effort. This would seem to be a reasonable effort for CEC’s Division for Research to undertake in the near future in conjunction with relevant Federal agencies, other divisions of CEC, and, potentially, agencies that review research grant applications relating to special education. Quality Indicators 29 ReferencesBoruch, R. F. (1997). Randomized experiments for planning and evaluation: A practical guide.Thousand Oaks, CA: Sage Publications.Bottge, B. A., Heinrichs, M., Mehta, Z. D., & Ya-Hui, H. (2002). Weighing the benefits ofanchored math instruction for students with disabilities in general education classes.Journal of Special Education, 35, 186.Campbell, D. T., & Stanley, J. C. (1966). Experimental and quasi-experimental designs forresearch. Boston, MA: Houghton Mifflin Company.Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ:Erlbaum.Cooper, H., & Hedges, L. V. (Eds.). (1994). Handbook of research synthesis. New York: RussellSage Foundation.Elbaum, B., Vaughn, S., Hughes, M., & Moody, S. W. (1999). Grouping practice sand readingoutcomes for students with disabilities. Exceptional Children, 65, 399-415.Fabes, R. A., Matrin, C. L., Hanish, L. D., & Updegraff, K. A. (2000). Criteria for evaluating thesignificance of developmental research in the twenty-first century: force andcounterforce. Child Development, 71, 212-221.Ferretti, R. P., MacArthur, C. D., & Okolo, C. M. (2001). Teaching for historical understandingin inclusive classrooms. Learning Disability Quarterly, 24, 59.Feuer, M. J., Towne, L., & Shavelson, R. J. (2002). Scientific culture and educational research.Educational Researcher, 31, 4-14. Quality Indicators 30 Forness, S., Kavale, K. A., Blum, I. M., & Lloyd, J. W. (1997). Mega-analysis of meta-analyses:What works in special education and related services. Teaching Exceptional Children, 29,4-9.Fuchs, D., & Fuchs, L. S. (1998). Researchers and teachers working together to adapt instructionfor diverse learners. Learning Disabilities Research & Practice, 13, 126-137.Gall, M. D., Borg, W. R., & Gall, J. P. (2002). Educational research: An introduction (7th ed.).White Plains, NY: Pearson Allyn & Bacon.Gersten, R., Baker, S., & Lloyd, J. W. (2000). Designing high quality research in specialeducation: Group experimental design. Journal of Special Education, 34, 2-18.Gersten, R., Baker, S., Smith-Johnson, J., Peterson, A., & Dimino, J. (in press). Eyes on theprize: Teaching history to students with learning disabilities in inclusive settings.Gersten, R., Schiller, E. P., & Vaughn, S. (Eds.). (2000). Contemporary special educationresearch: Syntheses of the knowledge base on critical instructional issues. Hillsdale, NewJersey: Lawrence Erlbaum and Associates, Inc.Gresham, F. M., MacMillan, D. L., Beebe-Frankenberger, M. E., & Bocian, K. M. (2000).Treatment integrity in learning disabilities intervention research: Do we really know howtreatments are implemented? Learning Disabilities Research & Practice, 15, 198-205.National Research Council. (2002). Scientific research in education. R.J. Shavelson & L. Towne(Eds.), Committee on Scientific Principles for Educational Research. Washington, D.C.:National Academy Press.Nunnally, J. C., & Bernstein, I. H. (1994). Psycholmetric theory (3rd ed.). New York: McGraw-Hill. Quality Indicators 31 Palincsar, A. S., & Brown, A. L. (1984). Reciprocal teaching of comprehension-fostering andcomprehension-monitoring activities. Cognition and Instruction, 1, 117-175.Shadish, W. R., Cook, T. D., & Campbell, D. T. (2002). Experimental and quasi-experimentaldesigns for general causal inference. Boston: Houghton Mifflin.Simmerman, S., & Swanson, H. L. (2001). Treatment outcomes for students with learningdisabilities: How important are internal and external validity? Journal of LearningDisabilities, 34, 221-236.Swanson, H. L., & Hoskyn, M. (1998). Experimental intervention research on students withlearning disabilities: A meta-analysis of treatment outcomes. Review of EducationalResearch, 68, 277-321.Valentine, J. C., & Cooper, H. (2003). What Works Clearinghouse Study Design andImplementation Assessment Device (Version 1.0). Washington, DC: U.S. Department ofEducation.Walberg, H. J. (1986). Synthesis of research on teaching. In M. C. Wittrock (Ed.), Handbook ofresearch on teaching (3rd ed., pp. 214-229). New York, NY: Macmillan. Quality Indicators 32 Author NoteWe wish to acknowledge the excellent editorial feedback provided by Susan Marks, MadhaviJayanthi, Jonathan R. Flojo, and Michelle Spearman. Quality Indicators 33 FootnoteIf an intervention is effective for the population targeted, statistical power denotes theprobability that a statistical test will accurately denote statistical significance. Factorstraditionally identified as determining statistical power include, sample size, anticipated effectsize, and type of statistical test selected. Given the increasing use of multi-level models wewould like to add the number of assessment waves as an additional factor affecting statisticalpower.Under conditions of study economy (as is often the case in special education interventionresearch), the appropriate number of units needed in the design is determined by the effect sizeone ‘expects’ to detect (small vs. medium vs. large). Cohen’s (1988) traditional break out ofeffect size ranges is: .2 = small, .5 = moderate, and .8 and greater = large. Effect sizes in the .40or larger range are often considered minimum levels for ‘educational’ or ‘clinical’ significance(Forness, Kavale, Blum, & Lloyd, 1997; Walberg, 1986). Experienced special educationinvestigators determine the number of participant units needed in a study using effect sizesdetermined from pilot data that have used the intervention and relevant outcome variables.Alternatively, one may obtain “expected” effect sizes from related published research for thepurpose of conducting a power analysis and planning a new study. Power analyses should alwaysbe conducted to help make decisions concerning the minimal sample size necessary and thenumber of comparison conditions that are truly viable. In early stages of research on a topic, itmakes sense to design a study conservatively, with a large enough sample to increase the chanceof detecting small effects. Conducting a study in two or even three waves or cycles often makessense in these instances. Quality Indicators 34 In some cases, the costs of designing a study capable of detecting small effects maysimply not be interesting or worth it. Thus, smaller participant units can be used. The point is thatspecial education researchers can conduct studies that are correctly powered relative to availableresources, and student population sizes, and goals of the research when hypothesizing moderateto large effects and keeping the number of experimental conditions small (i.e., 2 rather than 3group design). Replicating a series of studies with smaller samples powered to detect moderateto large effects appears to offer a powerful approach to the discovery of causal relations in smallpopulations/samples. Quality Indicators 35

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The main indicators for Iranian hospital ethical accreditation

Introduction: The application of organizational ethics in hospitalsis one of the novel ways to improve medical ethics. Nowadaysachieving efficient and sufficient ethical hospital indicators seemsto be inevitable. In this connection, the present study aims todetermine the best indicators in hospital accreditation.Methods: 69 indicators in 11 fields to evaluate hospital ethicswere achieved throug...

متن کامل

Comparison of effectiveness of acceptance commitment therapy and metacognitive therapy on reducing symptoms, psychological capital and quality of life of patients suffering from irritable bowel syndrome

Background: The aim of the present study was to compare effectiveness of metacognitive therapy (MCT) and acceptance commitment therapy (ACT) on reducing syndromes, psychological capital and quality of life of patients suffering from irritable bowel syndrome (IBS). Materials and methods: Research method was quasi-experimental with pre-test, post- test, three-month follow-up and control group. St...

متن کامل

Effect of an 8-Weeks of Comprehensive Corrective Protocol on Postural control, low Back Pain, Gait Speed and Quality of Life on Unilateral Transtibial Amputees

Aims: Amputation makes a person susceptible to postural deviation and compensatory movements and causes reduced performance, biocompatibility, pain, and dissatisfaction. The purpose of this study was to investigate the effect of eight weeks of a comprehensive rehabilitation protocol on indicators of postural control, back pain, quality of life, and gait speed in adults with unilateral lower lim...

متن کامل

Effects of Aerobic Exercise on Some Metabolic Indices, Quality of Life and General Health of Women Recovered from Corona Disease

Introduction and purpose: Complications caused by corona disease include a wide range of physical and mental problems, which in some cases may continue for weeks or months after recovery. The aim of this study was to investigated the effect of an aerobic training course on some metabolic indicators, quality of life and general health of women recovered from corona disease. Materials and methods...

متن کامل

The Effects of Location and Environmental Indicators on the Quality and Level of Education

Among the issues related to the development of the country, education is one of the basic principles and the tree of development. The educational system of the society is considered as one of the main pillars in realizing this goal and achieving social justice. Rural areas, with a population of 20 million, are considered as one of the main spots of production and human resources in the country....

متن کامل

The effectiveness of the self- care education programs on the quality of life in School Age children with Nephrotic Syndrome: A Quasi- Experimental Study

Introduction: Nephrotic syndrome is a common disorder in pediatrics that is associated with glomerular disorders. Children with nephrotic syndrome experience frequent hospitalizations which affects the quality of life of the patient and family. This study conducted to investigate the effectiveness of self-care education on the quality of life in children with nephrotic syndrome. Methods: This q...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004